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METHOD AND SYSTEM FOR 
ENVIRONMENTALLY ADAPTIVE FAULT 
TOLERANT COMPUTING 

GOVERNMENT RIGHTS 5 

The United States Government may have acquired certain 
rights in this invention pursuant to Contract No. NM07 10209 
awarded by the NASA. 

10 

BACKGROUND 

I. Field of the Invention 

The present invention is directed to mitigating radiation 
induced faults. More particularly, the present invention is 15 
directed to a method and/or system for handling an inherent 
susceptibility of Commercial -Off-The-Shelf (“COTS”) com- 
ponents to Single Event Upsets (“SEUs”). The invention is 
particularly useful in providing real time environmental sens- 
ing, utilizing a COTS based computer architecture that sup- 20 
ports adaptable configuration levels of fault tolerance, while 
also increasing performance and efficiency while maintain- 
ing reliable operation. However, aspects of the invention may 
be equally applicable in other scenarios as well. 

II. Description of Related Art 25 

Science and defense missions alike have increasing 

demands for data returns from their space born assets. In more 
recent times, there has been an increase in the capability of the 
instruments deployed in space. For example, such an increase 
has been discussed in the following references which are 30 
herein entirely incorporated by reference and to which the 
reader is directed to for further information: “An Overview of 
Earth Science Enterprise”, NASA Goddard Space Flight 
Center, FS-2002-3-040-GSFC, March 2002; Wallace M. Por- 
ter And Harry T. Enmark, “A System Overview of The Air- 35 
borne Visible/Infrared Imaging Spectrometer (AVIRIS)”, 
JPL Pasadena, Calif.; and H. L. Huang, “Data Compression 
of High-spectral Resolution Measurements”, Satellite Direct 
Readout Conference for the Americas, December 2002. 

In one typical approach for data gathering, data compres- 40 
sion and data transmission no longer appears sustainable. It is 
difficult to transmit a vast amount of data via available down- 
link channels in a reasonable period of time. One proposed 
solution to such a situation is to reduce demand on a downlink 
by moving processing away from earth and onto the space 45 
bom asset. 

However, there are certain limitations to such an approach. 

For example, this approach is hampered by limited capabili- 
ties of conventional on-board processors. It is also prohibitive 
based on the cost of developing radiation hardened high- 50 
performance electronics. Such issues are discussed in the 
references J. Marshall and R. Berger, “A Processor Solution 
for the Second Century of Powered Space Flight,” Digital 
Avionics Systems Conferences, 2000. Proceedings. DASC. 
The 19th Volume: 2, 7-13 Oct. 2000, Pages: 8.A.2_1- 55 
8.A.2_8 and Gary R. Brown, “Radiation Hardened PowerPC 
603e™ Based Single Board Computer,” 20^ Digital Avionics 
Systems, 2001 . October 2001 herein entirely incorporated by 
reference and to which the reader is directed for further infor- 
mation. 60 

Based in part on these perceived concerns, the relevant 
industry has considered the use of COTS components. For 
example, such general considerations are generally described 
in the reference E. R. Prado et al., “A Standard Approach to 
Spacebome Payload Data Processing,” IEEE Aerospace Con- 65 
ference, March 2001 herein entirely incorporated by refer- 
ence and to which the reader is directed for further informa- 
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tion. Furthermore, a more recent adoption of silicon-on- 
insulator (“SOI”) technology by COTS integrated circuit 
foundries has also resulted in devices with moderate space 
radiation tolerance. See, e.g., the references F. Irom et al., 
“Single-Event Upset in Evolving Commercial Silicon-on- 
Insulator Microprocessor Technologies, Nuclear and Space 
Radiation Effects Conference 2003 and Xilinx Corporation, 
“QPro Virtex 2.5V Radiation Hardened FPGA,” Xilinx Web 
site http://www.xilinx.com/, November 2001 herein entirely 
incorporated by reference and to which the reader is directed 
for further information. 

Despite such progress, COTS components continue to be 
somewhat highly susceptible to SEUs. One popular approach 
for mitigating such SEUs is to employ fixed component level 
redundancy. See, e.g., Daniel P. Siewiorek and Robert S. 
Swarz, Reliable Computer Systems Design and Evaluation 
y d edition, MA: AK Peters Ltd., 1998 herein entirely incor- 
porated by reference and to which the reader is directed for 
further information. However, one disadvantage of utilizing 
fixed component level redundancy is its low efficiency and its 
unrealized system capacity. 

Certain conventional onboard processing computers con- 
sist mostly of radiation hardened components based on COTS 
equivalents. Though COTS compatibility offers certain per- 
ceived benefits, including adoption of commercial software, 
typically large amounts of Non-Recurring Engineering 
(NRE) are often required for an initial silicon implementa- 
tion. Additionally, radiation hardened components often lag 
their commercial counterparts in overall performance and 
capability by at least 1 to 2 orders of magnitude. There are a 
number of factors that contribute to this deficiency. One such 
factor relates to radiation-hardening techniques and that such 
techniques for microelectronics require the use of fixed tran- 
sistor or gate level redundancy. This additional logic 
increases the power required to perform the same unit of 
computation. 

An approach towards improvement concerns the use of 
true COTS microprocessors and Field Programmable Gate 
Arrays (“FPGAs”). Typically, such an approach avoids the 
high cost and long development time associated with radia- 
tion hardened equivalents. However, true COTS devices are 
typically quite susceptible to SEUs. One popular SEU miti- 
gation approach is to use component level N-module redun- 
dancy. However, such N-module redundancy often results in 
low efficiency and low capacity due to an overhead that often 
approaches 2 A, or more. 

Furthermore, the level of redundancy is fixed and is often 
unnecessary. To overcome the deficiencies of fixed redun- 
dancy, two characteristics of space missions may be focused 
on: first, the variability of space environment and second, the 
task level criticality. Most missions will have a mix of pro- 
cesses with varying criticality. This characteristic of mission 
processing can be exploited to increase a systems efficiency 
by applying redundancy at a task level. Furthermore, there is 
a variability involved in a space environment and this vari- 
ability provides a temporal and orbital position dependency 
on the necessary redundancy. 

There is, therefore, a general need for a method and/or 
system for the mitigation of radiation induced faults 
(“SEUs”). There is also a general need for a method and 
system that can utilize lower cost COTS components in space 
which exhibit acceptable overall TID and Latch Up charac- 
teristics, but are still susceptible to SEUs. A further need 
exists for a system and/or method that facilitates the use of 
COTS components in SEU abundant environments, while 
also maintaining adequate levels of system efficiency and 
capacity. 
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There is a further need for such systems and methods of 
accomplishing such adequate levels of system efficiency and 
capacity by adaptively configuring a level of fault tolerance in 
a system as mandated by a mission environment and/or a 
mi ssion application. Consequently, there is a general need for 5 
real time environmental sensing, utilizing a COTS based 
computer architecture that supports adaptable configuration 
levels of fault tolerance, while also optimizing performance 
and efficiency while maintaining reliable operation. 

SUMMARY 

According to an exemplary embodiment, a method of 
adapting fault tolerant computing comprises the steps of mea- 15 
suring an environmental condition representative of an envi- 
ronment and analyzing an on-board processing system’s sen- 
sitivity to the measured environmental condition; and 
determining whether to reconfigure a fault tolerance of the 
on-board processing system based in part on the measured 20 
environmental condition. 

In an alternative embodiment, a system for environmen- 
tally adaptive fault tolerant computing (EAFTC) comprises a 
sensor that senses a characteristic of a dynamic environment 25 
and generates an output signal based on the characteristic. A 
system configuration controller receives the output signal, the 
controller assessing a potential environmental threat to an 
availability of the system based in part on the output signal. A 
computing device receives an input from the controller. A 30 
configuration of the computing device is adapted to effec- 
tively mitigate the potential environmental threat to the sys- 
tem’s availability. 

These as well as other advantages of various aspects of the 
present invention will become apparent to those of ordinary 35 
skill in the art by reading the following detailed description, 
with appropriate reference to the accompanying drawings. 

BRIEF DESCRIPTION OF DRAWINGS 

40 

An exemplary embodiment of the present invention is 
described herein with reference to the drawings, in which: 

FIG. 1 illustrates one arrangement of an EAFTC based 
system incorporating aspects of the present invention; 45 

FIG. 2 illustrates one arrangement of a target computer that 
may be utilized with the EAFTC based system illustrated in 
FIG. 1; 

FIG. 3 illustrates one arrangement of an adaptive process- 50 
ing computer that may be utilized with the target computer 
illustrated in FIG. 2; 

FIG. 4 illustrates one arrangement of a rapid I/O system 
that may be utilized with the target computer illustrated in 
FIG. 2; 55 

FIG. 5 illustrates one arrangement of an alarm module that 
may be utilized with the target computer illustrated in FIG. 2; 

FIG. 6 illustrates a software framework for the target com- 
puter illustrated in FIG. 2; 

FIG. 7 illustrates an exemplary block diagram of the 
EAFTC controller illustrated in FIG. 1; 

FIG. 8 illustrates an exemplary block diagram of reliable 
middleware that may be utilized with the EAFTC controller 
illustrated in FIG. 1; 65 

FIG. 9 illustrates one example of applying the EAFTC 
system illustrated in FIG. 1. 


A. General Overview of EAFTC System 

FIG. 1 illustrates an exemplary block diagram of a first 
arrangement for an EAFTC based system 10. Preferably, 
EAFTC based system 10 employs a system level fault toler- 
ance based on historical and/or environmental conditions. 
EAFTC system 10 comprises an EAFTC controller 12, an 
environmental sensor suite 14, and a target computer 16. 
EAFTC controller 12 comprises history 18 and a deployment 
plan 20. Sensor suite 14 preferably comprises a plurality of 
sensors including but not limited to a SEU alarm 22, environ- 
ment measurement 24, and the spacecraft 26. Other sensor 
suite arrangements are also possible. 

A preferred process implemented by the arrangement illus- 
trated in FIG. 1 includes the following steps: First, sensor 
suite 14 provides a method of sensing an environmental con- 
dition. For example, sensor suite 14 can provide an energy 
level indication 32 from SEU alarm 22, a sensor response 34 
from environmental measurement 24, or alternatively ephem- 
eris 36 from spacecraft 26. Once such signal or signals are 
received, EAFTC controller 12 evaluates the environmental 
condition (which may be an environmental threat) to the 
system’s 10 availability. If EAFTC controller 12 determines 
that such an environmental threat exists, system 10 then 
adapts (if deemed necessary) a configuration of target com- 
puter 16. In this manner, system 10 effectively and dynami- 
cally mitigates potential threats presented by the environ- 
ment. As seen from FIG. 1, the direction of data flow 38 
proceeds from sensor suite 14 through EAFTC controller 12 
and then towards target computer 16. 

In general, EAFTC controller 12 may be implemented to 
accept various different environmental inputs from sensor 
suite 12 that can induce faults in target computer 16, such as 
a payload computer system. However, for a particular 
arrangement presently discussed herein, environmental 
monitoring may be focused on measurements of high-energy 
particle flux such that may occur in a space bom asset. For 
example, in such the situation where the EAFTC system 10 is 
provided in a spacecraft, by monitoring flux of high-energy 
particles, it is possible to assess the systems overall suscep- 
tibility to SEUs. However, those of ordinary skill in the art 
will recognize that alternative measurement and system 
arrangements and/or alternative environmental inputs may 
also be utilized. 

Returning to FIG. 1, sensor measurements (e.g., tempera- 
ture, available power, etc) and a state of health of target 
computer 16 are continuously monitored by EAFTC control- 
ler 12 via health signal 42. Such information and data 42 are 
combined with a mission defined application task deploy- 
ment plan. Preferably, the mission defined application task 
deployment plan contains task level criticality requirements 
as well as other pertinent information used by EAFTC con- 
troller 12. Based on that input, EAFTC controller 12 deter- 
mines whether there exists any reliability and/or availability 
threat and preferably the level of such threat posed by the 
present environment in which the asset resides on target com- 
puter 16. 

EAFTC controller 12, which acts as a system configuration 
controller, then generates the requisite signals by way of 
process deployment which are then sent to adapt target com- 
puter. In this manner, the process deployment 40 will counter 
a potential hostile environmental threat to computer 16. 
Based on the threat assessment the system configuration con- 
troller 12 reconfigures the on-board processing system fault 
tolerance to match the threat level. The on-board processing 
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system preferably implements configurable fault tolerance 
that match the variable threats that will be encountered by the 
system. In response, target computer 16 optimally employs 
the requested fault tolerant mechanism. This process is per- 
formed in real-time and on-line as an integral part of overall 5 
operation of system 10. 

Hardware Implementation 

As may be seen from FIG. 1, EAFTC controller 12 receives 
certain commands from a target computer 1 6 by way of health 1 0 
signal 42. In one preferred arrangement, hardware for target 
computer 12 may comprise Honeywell’s Integrated Payload 
System. The Honeywell Integrated Payload System is essen- 
tially a cluster computer consisting of a multitude of data 
processors and one cluster manager. 

FIG. 2 illustrates one arrangement of a target computer 50 
that may be utilized with EAFTC system 10 illustrated in 
FIG. 1. In this arrangement, target computer 50 comprises 
various hardware elements including a system controllers 20 
52 a and 52 b, a plurality of data processors 64a, 64 b, 64 c, and 
64 d, a first packet switched fabric 62 a, a second packet 
switched fabric 62 b, and an environmental sensor suite 58. A 
power supply 56 is also provided. 

25 

A. System Controller 

System controller 52 for target computer 50 is preferably 
implemented using redundant, Radiation Hardened Single 
Board Computers. Such a reliable radiation hardened system 
controller 52 provides a platform for deployment of critical 30 
control software such as the EAFTC controller. For example, 
in one arrangement, a potential candidate for a system con- 
troller 52 may comprise a Honeywell radiation hardened 
RHPPC Single Board Computer (“SBC”). See, for example, 
the description as provided by Gary R. Brown, “Radiation 35 
Hardened PowerPC 603e Based Single Board Computer,” 
IEEE Aerospace Conference, 2001 (http://cism.jpl.nasa.gov/ 
events/ seminardocs/Big_sky__08 02 01 .pdf). 

In one preferred arrangement, radiation hardened SBC is 
based on Motorola 603e microprocessor technology. Such a 40 
radiation hardened SBC is generally described in Gary R. 
Brown, “Radiation Hardened PowerPC 603e™ Based Single 
Board Computer,” 20 th Digital Avionics Systems, 2001. 
October 200 1 herein entirely incorporated by reference and to 
which the reader is directed to for further information. The 45 
use of an RHPCC SBC may be preferred for a number of 
reasons. Some of these reasons are summarized in Table 1 
provided below: 

TABLE 1 50 

RHPPC SBC Features 

Salient Features 
3.3 V and 5.0 V Power 

RHPPC delivering 100 MIPS 55 

Peripheral Enhancement Component support 

chip 

4 MB EEPROM with Single Error Correction 

and Double Error Detection 

512 KB EEPROM 

128MB DRAM with SuperEDAC 

6 U x 220 mm Euro Card Form Factor 

Max Power Draw 15 W 

Mass >3 lbs 

Redundant 1553 (interface to spacecraft 
computer) 

32-bit 33 MHz PCI (interface to cluster and 

MIB electronics) 65 


B. Data Processors 

As illustrated in FIG. 2, target computer 50 further com- 
prises a plurality of data processors 64. In this preferred 
arrangement, plurality of data processors comprise a first, a 
second data, a third, and a fourth data processor 64 a, b, c, and 
d, respectively. In one preferred arrangement, these data pro- 
cessors comprise COTS based processors. More preferably, 
these data processors comprise COTS based processors com- 
prising a unique architecture herein referred to as an Adaptive 
Processing Computer (“APC”). APC is a multi-mode device 
that combines the use of COTS microprocessors and FPGAs 
on a single platform. In one arrangement, the APC employs a 
COTS IBM PowerPC 750FX microprocessor and a Xilinx 
VirtexII 6000 FPGA. The IBM 750fx and Xilinx VirtexII 
devices are suitable COTS devices for flight experiment. 

C. Adaptive Processing Computer 

FIG. 3 illustrates one arrangement of an adaptive process- 
ing computer (“APC”) 80 that may be utilized with target 
computer 50 illustrated in FIG. 2. APC 80 comprises a COTS 
compute resources portion 82 and a portion comprising a 
radiation hardened configuration manager 84 along with sup- 
porting functions. Configuration manager 84 handles various 
functions including but not limited to mode changes of APC 
80, basic FPGA configuration, FPGA configuration memory 
scrubbing, low-level health monitoring, and power mode 
control. 

In one preferred arrangement, APC 80 may implement a 
plurality of operational modes of operation. For example, 
APC 80 may implement a microprocessor mode, a custom 
processor mode, and a hybrid processor mode. The mode of 
operation may be determined by the active configuration of a 
FPGA labeled Processing Element/Processor Controller 
(“PE/PC”) 88 in FIG. 3. 

1 . Microprocessor Mode 

APC 80 may be configured in a microprocessor mode. 
While in this mode, APC’ s FPGA is configured as a Processor 
Controller and the microprocessor is enabled. As such, APC 
behaves much like a SBC. Processor Controller FPGA hosts 
all of the support functions for PPC including IO, memory 
controller, interrupts, timers, etc. 

2. Custom Process 

When enabled as a custom process, microprocessor is dis- 
abled and does not execute software. While APC 80 is in this 
custom process mode, FPGA of PE/PC 88 is configured as a 
Processing Element and hosts a full-custom application 
including all IO and processing logic. The processing logic in 
Processing Element is defined by an image loaded into 
FPGA’s configuration memory by configuration manager 84. 
Configuration manager 84 receives commands from software 
on system controller 52 of target computer 16 (see FIG. 2). 

3. Hybrid Mode 

The third APC capability is a hybrid mode operation. In the 
hybrid mode, FPGA hosts processor controller for micropro- 
cessor as well as application specific modules. This third 
alternative mode canbe likened to a co-processor system. The 
application specific modules could be Digital Signal Process- 
ing (“DSP”) functions, data compression, vector processors, 
etc. As with the custom mode, the use of application specific 
modules may result in high efficiency and performance 
yields. For example, a general description of such efficiencies 
and performance yields is generally described in J. S. Donald- 
son, “Push the DSP Performance Envelop,” Xilinx Xcell 
Journal, Spring 2003 herein entirely incorporated by refer- 
ence and to which the reader is directed for further informa- 
tion. This third mode also offers additional flexibility by 
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retaining a programmable microprocessor and access to cus- 
tom hardware. APC is also capable of dynamic switching 
between these modes. Such a feature may prove useful in 
many applications. For instance, such a feature may prove 
useful if multiple data channels are part of the same payload, 5 
then the APC’s operating mode can be switched to better 
serve the needs of the active data channel. 

APC’s flexibility allows one to adopt the target processor 
for a variety of mission level requirements. As just one 
example, enhanced efficiency may be achieved by using more 1 0 
custom hardware modules inFPGA. Similarly, enhanced pro- 
cessing performance may also be realized in FPGA modules. 
However, for certain applications that may require enhanced 
programmability, microprocessor mode might be a more suit- 
able application. Utilizing an APC can facilitate these needs. 15 
Moreover, other implementation alternatives are not typically 
available in on-board processor modules. An example of the 
APC’s flexibility is in a processing situation where there is a 
mix of control flow as well as data flow processing on the 
same computer. Control flow applications are generally more 20 
likely to be sequential where data flow tends to be more 
parallel. In the case of sequential applications, a micropro- 
cessor may yield acceptable performance results. However, 
parallel applications can better use the FPGA co-processor to 
accelerate their processing. 

Certain relevant features of a preferred APC, such as APC 
80, are provided below in Table 2. 

TABLE 2 


APC Features 
Features 


750 ix @ 650 MHz Delivering 1300 MIPS 

VirtexII 6000 Processing Element/Processor 

Controller 

PCI 32-bit 33 MHz 

Rapid FO 

128 MB DRAM with Super EDAC 
4 MB EEPROM with SECDED EDAC 
Configuration Manager with support FPGA 
SEU mitigation 

PCI-to-PCI bridge facilitating a local PCI bus 

Ethernet development interface 

6 U x 220 mm Euro Card Form Factor 

Mass <3 lbs 

Max Power Draw 20 W 


Returning to FIG. 2, target computer 50 further comprises 
a packet switched fabric A 60 and packet switched fabric B 
62. Preferably, the various modules comprising system 50 are 
interconnected via a packet switched fabric based on a 
RapidIO (“RIO”) industry standard. Additional information 50 
on this industry standard, the reader is directed to RapidIO 
Trade Association Web site at http://www.rapidio.org/ herein 
entirely incorporated by reference and to which the reader is 
directed to for further information. 

RIO is an industry standard and is generally recognized as 55 
one of the more popular, conventional COTS interconnect. 
Certain conventional payload data processor interconnects 
are based upon multi-drop configurations. Such multi-drop 
configurations include but are not limited to MODULE BUS, 
PCI and VME. One advantage of such multi-drop systems is 60 
that they distribute available bandwidth over each module. 
However, this may result in producing points of contention 
among participant nodes often resulting in system level 
bottlenecks. 

In contrast to such multi -drop systems, RIO implements a 65 
packet- switched, point-to-point interconnect. Such an inter- 
connect has certain advantages. For example, packet- 
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switched, point-to-point interconnects allow, multiple full- 
bandwidth point-to-point links to be simultaneously 
established between end -nodes in a network. Another advan- 
tage of packet- switched, point-to-point interconnects is that 
they reduce contention while also delivering more bandwidth 
to an application. 

FIG. 4 illustrates one arrangement of a rapid I/O (“RIO”) 
system 100 that may be utilized with target computer 50 
illustrated in FIG. 2. RIO system 100 comprises sensor data 
116, two processors 102 and 104, a rapidIO switch 108, bulk 
memory 110, general purpose I/O 114, a backplane 106, and 
non-volatile memory 112. 

RIO system 100 comprises essentially two building 
blocks: a RIO end-node 120 and a RIO switch 122. Each 
end-node 120, 122 in RIO system 100 comprises a RIO 
network interface. Each RIO network interface comprises a 
point-to-point link to shared RIO Switch 108. RIO switch 108 
receives and routes packets to the appropriate destination over 
backplane 106. The non-blocking nature of RIO allows con- 
current routing of multiple packets. For example: sensor data 
116 may be stored in bulk memory 110 at the same time as 
processors 102, 104 access general purpose I/O 114. By using 
multiple switches as illustrated in FIG. 4 in the EAFTC sys- 
tem 10 of FIG. 1, topologies consisting of hundreds or thou- 
25 sands of nodes may be achieved. 

In one preferred arrangement, RIO interfaces are based on 
LVDS signaling technology and can achieve bandwidths of 
up to 60 Gbits/ s for each active link. A 1 6 bit RIO system with 
two active point-to-point links is capable of 120 Gbits/s pro- 
30 viding >120x performance increase over a 33 MHz 32 bit 
Compact PCI based system. 

One benefit of a RIO protocol is this protocol’ s error detec- 
tion and recovery mechanism. By combining retry protocols, 
cyclic redundancy codes (“CRC”) and single/multiple error 
35 detection, RIO handles all in network errors without applica- 
tion intervention. This inherent error handling and recovery 
capability proves beneficial for certain applications that may 
require a generally high reliable interconnect, such as space 
applications. 

40 

Environmental Sensor Suite 

Returning to FIG. 2, target computer 50 further comprises 
an environmental sensor suite 58. Therefore, EAFTC system 
10 relies, to a certain extent, on an ability to sense its envi- 
45 ronment. As part of PSI’s Reconfigurable Environmentally - 
Adaptive Computing Technology (REACT), a miniature 
embedded radiation monitor, the SEU Alarm has been devel- 
oped. The SEU Alarm is based on certain flight -proven tech- 
nology originally developed for PSI’s radiation diagnostic 
instrumentation. General background information on this 
radiation diagnostic instrumentation may be obtained from 
Physical Sciences Inc. Web site http://www.psicorp.com/in- 
dex.shtml herein entirely incorporated by reference and to 
which the reader is directed to for further information. Advan- 
tages of a SEU Alarm over conventional sensors are its rela- 
tively small foot print and that the Alarms are designed to 
support SEU rate predictions. 

In one arrangement, SEU alarm (shown as alarm 22 in FIG. 
1) provides continuous monitoring of proton and heavy-ion 
fluxes that cause single event upsets. In one preferred arrange- 
ment, SEU alarm comprises a small block of scintillators 
coupled to a photo-detector. For example, FIG. 5 illustrates 
one such arrangement of a SEU alarm module 150. Module 
150 comprises three sensors 152, 154, 156, respectively 
coupled to three controller electronics 160, 162, and 164. 
Module 150 further comprises a controller 166, and a network 
interface 168. Controller 166 provides the control and inter- 
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face register for software interface to the sensor modules. 
Software configures each sensor for a given application by 
setting alarm thresholds and refresh rates. Software can also 
access the alarm measurements for use in evaluating the 
threat to the system. 

SEU alarm 150, by way of sensors 152, 154, 156, provides 
continuous monitoring of the proton and heavy -ion fluxes that 
cause single event upsets. The basic components of the SEU 
Alarm are a small block of scintillators coupled to a photo- 
detector. In one preferred arrangement, a number of these 
devices can be consolidated onto a single module. 

Software Framework 

FIG. 6 illustrates a preferred software framework 180 for a 
target computer, such as the target computer 16 illustrated in 
FIG. 1. Software framework 180 comprises an operating sys- 
tem/system software, fault tolerant system controller/node, 
EAFTC controller 192, messaging middleware 200, and reli- 
able platform middleware 216. One objective of the target 
computer software framework is to provide system develop- 
ers with a stable yet familiar software platform. In FIG. 6, the 
software comprises mission specific payload control 196 and 
communications hosted on system controller 194, and appli- 
cation processes distributed across data processor cluster 181. 
These software components may be developed using COTS 
environments and associated Application Program Interfaces 

(“APIs”). 

In one preferred arrangement, the proposed Operating Sys- 
tems are VxWorks 202 for System Controller 194 and Linux 
for Data Processor cluster 181. Information on this proposed 
Operating System by VxWorks may be found at Wind River 
Systems Web site http://www.windriver.com/ which is herein 
entirely incorporated by reference and to which the reader is 
directed for further information. 

VxWorks OS 202 provides the capabilities necessary for 
the deployment of real-time control processes such as those 
implemented by EAFTC controller 192, fault tolerant system 
controller 194, and payload control and communications 196. 
VxWorks OS 202 also provides a familiar platform for devel- 
opers of these types of applications. Data processor cluster 
181, unlike system controller 194, is the domain of the sci- 
ence application developer. In this case, Linux OS 220 is a 
preferred OS due to its popularity in the scientific community. 
To mitigate concerns associated with the interaction of het- 
erogeneous operating systems, a COTS messaging middle- 
ware 214 may also be introduced. For example, the messag- 
ing component of Go Ahead’s SelfReliant Middleware 
provides a common interface for communication between 
Linux OS 220 and VxWorks OS 202 along with a variety of 
practical messaging services such as publish-sub scribe, and 
replicated databases. See, e.g., GoAhead Web sitehttp://ww- 
w.goahead.com/ which is herein entirely incorporated by ref- 
erence and to which the reader is directed for further infor- 
mation. 

Messaging within data processor cluster 181 may be 
accomplished via Reliable Platform (RP) Middleware 216, 
which is also responsible for the Software Implemented Fault 
Tolerance (SIFT) in the cluster. C. J. Walter, P. Lincoln and N. 
Sun, “Formally verified on-line diagnosis,” IEEE Trans, on 
Software Engr., vol. 23, #11, pp. 684-721, November 1997 
which is herein entirely incorporated by reference and to 
which the reader is directed to for further information. 
Together, the OS and Middlewares provide the base platform 
on which other software may be implemented. 
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EAFTC and RP Middleware 

In one preferred arrangement, EAFTC comprises essen- 
tially two software components: a Reliable Platform Middle- 
ware (RP) and an EAFTC controller. 

5 1 . EAFTC Controller or System Configuration Controller 

EAFTC controller or system configuration controller pro- 
vides control of an EAFTC based system illustrated in FIG. 1 . 
Since the integrity and dependability of the EAFTC system 
relies on this controller its realization must be highly reliable. 
Hence, the EAFTC may be selected to be implemented as a 
software component hosted on a reliable system controller. 
One advantage of such a system is that this implementation 
provides an enhanced level of flexibility for future use and 
adaptations. 

FIG. 7 depicts an overview 230 of internal functions of a 
system controller 270 in the context of a characteristic system 
implementation. The general description of the various com- 
ponents comprising one preferred arrangement of a system 

controller is provided below. 

20 * 

In one arrangement, a system controller 270 comprises an 
Environmental Server 242, Alert Level Generator 244, 
Deployment Plan 250, Deployment Generator 252, FPGA 
Configuration Controller 254, Health Monitor 256, and CPU 
Configuration Controller 258. Given a variety of possible 
sensory input, a function has been defined to collect and 
organize sensor signals into abstract representations that may 
be shared with other EAFTC components. Environmental 
Server 242 encapsulates the low-level interfaces to each of the 
sensors in the system, including the sampling of each signal. 
In the arrangement illustrated in FIG. 7, this would be from 
spacecraft 232 and SEU Alarm 234. 

Health Monitor 

Health Monitor 256 monitors a state 266 of each target 
35 system computer resource 236. Signals such as heartbeats, 
redundant output consistency mismatches, watchdog time- 
out, etc are collected via Fault Tolerant Controller/Node com- 
ponents. These signals are then provided to Health Monitor 
256. Given predefined policies, Health Monitor 256 makes a 
40 determination of the health for each Data Proces sor in an APC 
cluster, such as APC cluster 64 illustrated in FIG. 2 and APC 
cluster 181 in FIG. 6. This information is then shared with 
Deployment Generator 252 where it is used in determining 
the system’s task deployment from Deployment Plan 250. 

45 

History Database 

Although reacting to immediate sensory input may be 
adequate for certain applications, the ability to predict near 
future threats to an EAFTC system provides certain advan- 
tages. In particular, adapting fault tolerance to address antici- 
pated threats reduces an exposure of the system to faults. 
History Database 248 is a component of a predictive filter 
implemented in Alert Level Generator 244. As just one 
example, sensor measurements from a previous spacecraft 
orbit may be maintained in History Database 248 and subse- 
quently retrieved by Alert Level Generator 244 for use by 
Deployment Generator 252. 

Alert Level Generator 

The process of evaluating an environmental threat to an 
60 EAFTC system is implemented in Alert Level Generator 244 . 
Given the current sensory input received from spacecraft 232 
and/or SEU Alarm 234, Historical Database 248, and a set of 
system specific thresholds, Alert Level Generator 244 outputs 
a discrete threat level 245 for EAFTC system. An important 
65 algorithm of Alert Level Generator 244 is an Adaptive Linear 
Predictive Filter. This Adaptive Linear Predictive Filter gen- 
erates a particle flux prediction. Based on this particle flux 
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prediction, a series of user defined thresholds may be evalu- 
ated to determine a current system alert level to be used by a 
Deployment Generator in determining EAFTC system’s pro- 
cess deployment. 

Deployment Plan 

The on-line behavior of an EAFTC controller may vary 
based on a target environment, system level requirements, 
target application, target system architecture, and other 
implementation specific factors. This application specific 
behavior may be captured as a user defined parameter set. In 
particular, the Deployment Plan describes the desired system 
dependability for a given spacecraft position, threat level, and 
time. The Deployment Plan may be defined by the require- 
ments of each individual application process. 

Deployment Generator 

Once the system threat level has been assessed, Deploy- 
ment Generator 252 acts to counter the threat. Given a par- 
ticular Deployment Plan 250, target system health 262, and 
alert level 245, Deployment Generator 252 produces a new 20 
system deployment. The process of generating a new deploy- 
ment is primarily based on determining a lowest cost distri- 
bution of application processes (including number of repli- 
cas) across available target resources. The generated 
deployment is then sent to each node in a cluster where local 25 
actions implemented by Fault Tolerant Node software fulfill 
the deployment requests. Specifically, in one arrangement, 
Fault Tolerant Node collaborates with RP Middleware, as 
discussed in greater detail below, to deploy fault tolerance as 
requested. 30 

Configuration Controllers 

CPU Configuration Controller 258 is designed to interface 
with a particular target system 236 and provide process 
deployment 264. Where more than Configuration Controller 
258 is implemented and given a new deployment, each Con- 
figuration Controller generates the low-level signals to effect 
required changes in a targeted system. In a preferred arrange- 
ment, two Configuration Controller types are implemented. 
The first Configuration Controller is responsible for interac- 
tion with APC nodes operating in microprocessor mode. The 
second Configuration Controller interacts with APC nodes 
operating in custom processor mode. 

Reliable Platform (“RP”) Middleware 

The role of WW Technology’s RP in the overall EAFTC 45 
solution is that of Software Implemented Fault Tolerance 
(SIFT). SIFT is a fault tolerant technique that relies on soft- 
ware to provide redundancy at the process level. (See, e.g., 
Daniel P. Siewiorek and Robert S. Swarz, Reliable Computer 
Systems Design and Evaluation ¥ d edition, MA: AK Peters 50 
Ltd., 1998 herein entirely incorporated by reference and to 
which the reader is directed for further information.) The RP 
manages the fault tolerance of applications and services dis- 
tributed across clusters of processors by establishing a con- 
sistent framework and common context in which the system 55 
operates. 

In one preferred arrangement, RP consists of a set of ser- 
vices that facilitate the implementation of reliable systems 
through the dependable management of redundant/replicated 
resources. RP addresses the needs of composing systems 60 
utilizing COTS hardware and software components, as it 
offers a software based solution that provides transparent 
Fault Detection, Isolation and Removal (“FDIR”) services, 
enabling hosted applications to provide uninterrupted deliv- 
ery of service in the presence of faults. 65 

FIG. 8 illustrates an exemplary block diagram 300 of reli- 
able middleware that may be utilized with the EAFTC con- 
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troller illustrated in FIG. 1. FIG. 8 depicts a block diagram of 
the RP and its relationship to other software elements of the 
system. The main RP framework components are described 
as follows. 

Local Services 302 are services that are local to each pro- 
cessor in the distributed system. These services provide local 
functionality required for a processor to perform useful work 
in a cluster. Examples of these types of services include but 
are not limited to networking, local scheduling, timing, and 
inter process communications. 

Cluster Synchronization 304 establishes a dependable dis- 
tributed time base that is consistent across the entire system. 
This service is based on a message passing technique and uses 
local physical clocks at each component to form a logical 
system clock. Preferably, Cluster Synchronization 304 is 
scalable and efficiently establishes the time base across pro- 
cessors. This time base may be used as a backbone for sched- 
uling distributed operations across the cluster. 

System Configuration Services 306 establish and control 
the configuration of the cluster. The cluster configuration 
comprises the system physical resources and logical capabili- 
ties. The System Configuration Service interacts directly with 
the EAFTC Fault Tolerant Node component. This in turn 
communicates with Fault Tolerant Controller. EAFTC con- 
troller sends its generated deployment via Fault Tolerant Con- 
troller/Node to each processor’s System Configuration Ser- 
vice where deployment changes are finally effected. 

System Monitoring Services 314 supplies the system with 
an ability to dynamically assess the health of the cluster and 
localize failed processors and application processes. Assess- 
ments are made with a cluster wide perspective using distrib- 
uted decision-making and integrated monitoring information 
from across the cluster. Failure notifications from this service 
may be forwarded to the EAFTC Health Monitor via the Fault 
Tolerant Controller/Node components. 

Process Group Management In one preferred approach for 
enhancing the availability and dependability of payload 
applications relies on replication. The set of replicated 
instances are managed as a “process group.” This is a peer- 
to-peer entity in which the support services of each replica are 
constantly checking the performance/behavior of its local 
replica against that of its remote peers. 

Scheduling provides a scheduling mechanism that is avail- 
able to the hosted applications. This mechanism initially pro- 
vides indications to application processes as to when to per- 
form its execution cycle and when interaction with other 
support services may be performed. This scheduling mecha- 
nism is based on the common time base established through 
cluster synchronization. Operations controlled by this sched- 
uling service can be coordinated in time across all elements of 
the cluster. 

Data Integrity 308 provides consistent data sets across 
replicas. A deviation from this consistent data by a replica is 
to be interpreted as an error by that replica. This capability 
allows hosted applications to expose internal state data facili- 
tating warm starts of additional resources as they come on- 
line. Additional replicas may join an established group by 
adopting the internal state of the existing replicas. 

RP 312 offers its services in a flexible manner, supporting 
a distribution of applications that is not necessarily tied to the 
physical realization of the cluster. In one preferred arrange- 
ment, RP utilizes a clustering approach to manage a cluster 
processor. Application replicates are hosted on each RP-En- 
abled resource via RP Interface (RPI). This renders the appli- 
cation “unaware” of the fact that it has been replicated, or to 
what extent it has been replicated. RP works in the back- 
ground to monitor application behavior and recognizing 
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when a fault has resulted in application divergence. RP not 
only provides dependability to hosted applications, but RP is 
in-and-of itself dependable, capitalizing internally on the 
same techniques and properties conveyed to hosted applica- 
tions. 5 

The EAFTC system combines a set of innovative technolo- 
gies to enable a system and/or method for the efficient use of 
high performance COTS processors while these processors 
operate in generally harsh space environments. An enhanced 
level of performance may also be achieved while also main- 10 
taining a certain required system availability. For example, 
FIG. 9 illustrates one example 400 of applying the EAFTC 
system illustrated in FIG. 1. On the left side of FIG. 9, a 
particular satellite’s orbit 402 is illustrated as comprising a set 
of four regions. These regions comprise a first region 404, a 15 
second region 406, a third region 408, anda fourth region 410. 
Each region 404, 406, 408, 408, and 410 has associated there- 
with a varied radiation environment. Although only four 
regions and four radiation environments are illustrated, those 
of ordinary skill in the art will recognize that more or less than 20 
four regions may be employed. 

As the EAFTC system travels through orbit from one 
region to the next region, the system collects measurements of 
the SEU Alarm response to the radiation. This SEU Alarm 
response 414 is illustrated as a function of orbit position 25 
(404a-410a), fluctuating as the space borne craft traverses 
from one region to the next. The EAFTC system dynamically 
creates regions based on these measurements and based in- 
part on the on-board processing system’s sensitivity to radia- 
tion. 30 

As the EAFTC system enters and leaves a particular region, 
the system dynamically configures the fault tolerance to 
match the environment. The overall result is an increase in the 
system’s performance as depicted by curve 420. Curve 420 
represents the EAFTC system’s instructions per unit of 
power, in this case Millions of Instructions Per Second Per 
Watt (“MIPS/ Watt”). When compared to a conventional sys- 
tem designed for a worst case scenario, illustrated as a first 
alternative line 422, the average performance of an EAFTC 
system illustrated as a black dotted line 424 will be higher. 
Though the overall performance gain depends on a particular 
orbit and the on-board processing system’s sensitivity and 
adaptability, EAFTC provides a solution that is just as good if 
not better than the conventional approach. 

Therefore, the EAFTC system as illustrated in FIG. 1 miti- 
gates faults, and in particular SEUs in COTS devices. Such 
fault mitigation is accomplished while also increasing the 
system’s overall efficiency and capacity. EAFTC system 10 
accomplishes this feat by optimally applying fault tolerance 5Q 
over the life of the mission as demanded by the task criticality 
and environmental measurements. 

The proposed EAFTC system results in a novel technology 
for on-board payload processing. The disclosed EAFTC is a 
COTS based computing system architecture and associated 55 
system control algorithms that together provide a reliable 
on-board processing platform. Applicants’ EAFTC system 
senses an environment, assesses the fault threat presented by 
the environment, and adjusts the processing system’s fault 
tolerance to thereby effectively mitigate certain threats pre- 60 
sented by the environment. In this manner, EAFTC optimally 
employs fault tolerance based on historical and environmen- 
tal conditions. EAFTC can therefore also increase the overall 
system efficiency, in terms of unit of computations per Watt. 

Exemplary embodiments of the present invention have 65 
been described. Those skilled in the art will understand, how- 
ever, that changes and modifications may be made to these 
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embodiments without departing from the true scope and spirit 
of the present invention, which is defined by the claims. 

We claim: 

1. A method of adapting fault tolerant computing, said 
method comprising the steps of: measuring an environmental 
condition representative of an environment; analyzing an on- 
board processing system for sensitivity to said measured 
environmental condition; and determining whether to recon- 
figure a fault tolerance of said on-board processing system, 
based at least in part on said measured environmental condi- 
tion, with a controller comprising: an alert level generator 
configured to evaluate a potential environmental threat based 
on said measured environmental condition, the alert level 
generator comprising an adaptive linear predictive filter that 
generates a particle flux prediction; a deployment plan that is 
user definable to describe one or more application tasks and 
system thresholds; and a deployment generator configured to 
receive data from the alert level generator and the deployment 
plan. 

2. The method of claim 1 further comprising the step of 

reconfiguring said fault tolerance of said on-board process- 
ing system based in part on said measured environmen- 
tal condition. 

3. The method of claim 2 wherein said fault tolerance of 
said on-board processing system is reconfigured to match 
said environment. 

4. The method of claim 2 wherein said fault tolerance of 
said on-board processing system is reconfigured based in part 
on historical data. 

5. The method of claim 1 wherein said step of measuring 
said environmental condition occurs during an orbit position 
of said on-board processing system. 

6. The method of claim 1 wherein said on-board processing 
system is included in a space borne asset. 

7. The method of claim 1 wherein said step of measuring 
environmental conditions comprises the step of detecting 
radiation conditions. 

8. The method of claim 1 wherein said step of measuring 
said environmental condition comprises monitoring flux of 
high-energy particles that cause single event upsets. 

9. The method of claim 1 further comprising providing an 
alarm suite, said suite providing an alarm signal representa- 
tive of an environment threat. 

10. The method of claim 1 further comprising the step of 
collecting measurements of an alarm in response to said envi- 
ronmental condition on a space borne spacecraft. 

11. The method of claim 10 wherein said step of collecting 
measurements of an alarm in response to said environmental 
condition further comprises the step of 

sensing proton and heavy -ion fluxes in space. 

12. The method of claim 10 further comprising the step of 

generating a particle flux prediction based in part on said 

collected measurements of said environmental condi- 
tion. 

13. The method of claim 1 further comprising the step of 
operating said on-board processing system in a plurality of 
operational modes, said on-board processing system com- 
prising a COTS processor. 

14. The method of claim 1 wherein said step of determining 
whether to reconfigure said fault tolerance of said on-board 
processing system further comprises the step of 

predicting a near future threat of said system. 

15. A system for environmentally adaptive fault tolerant 
computing, said system comprising: a sensor that senses a 
characteristic of a dynamic environment and generates an 
output signal based on said characteristic; a system configu- 
ration controller that is executable by a processor and that 
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receives said output signal, said controller assessing a poten- 
tial environmental threat to an availability of said system 
based at least in part on said output signal, said controller 
comprising: an alert level generator configured to evaluate the 
potential environmental threat, the alert level generator com- 
prising an adaptive linear predictive filter that generates a 
particle flux prediction; a deployment plan that is user defin- 
able to describe one or more application tasks and system 
thresholds; and a deployment generator configured to receive 
data from the alert level generator and the deployment plan; 
and a computing device that receives an input from said 
controller; wherein a configuration of said computing device 
is adapted to effectively mitigate said potential environmental 
threat to the availability of said system. 

16. The system of claim 15 wherein said sensor comprises 
an environmental sensor suite. 

17. The system of claim 15 wherein said sensor senses 
radiation conditions. 

18. The system of claim 15 wherein said computing device 
is a payload computer system. 

19. The system of claim 15 wherein said controller assess- 
ing said potential environmental threat to said availability of 
said system is based in part on said input and also in part on 
previously measured environmental conditions. 

20. The system of claim 15 wherein said controller oper- 
ates in a plurality of operational modes. 

21. The system of claim 15 wherein said controller com- 
prises a COTS processor. 
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22. A system for environmentally adaptive fault tolerant 
computing, the system comprising: an environmental sensor 
suite comprising: a plurality of sensors including at least one 
single event upset alarm that provides an alann signal repre- 
5 sentative of a potential environmental threat; a controller that 
is executable by a processor and in operative communication 
with the environmental sensor suite, the controller compris- 
ing: an environmental server configured to receive sensory 
input signals from the plurality of sensors; a history database 
10 configured to contain prior sensor measurements; an alert 
level generator configured to evaluate the potential environ- 
mental threat based on the sensory input signals and the prior 
sensor measurements, the alert level generator comprising an 
adaptive linear predictive filter that generates a particle flux 
15 prediction; a deployment plan that is user definable to 
describe one or more application tasks and system thresholds; 
a deployment generator configured to receive data from the 
alert level generator and the deployment plan; and a computer 
health monitor in operative communication with the deploy- 
20 ment generator; and a computer comprising at least one data 
processor in operative communication with the controller and 
configured to effectively mitigate the potential environmental 
threat; wherein the deployment generator of the controller is 
configured to evaluate the potential environmental threat to 
25 the computer based on data from the deployment plan, the 
computer health monitor, and the alert level generator, and 
send a deployment signal to the computer to counter the 
potential environmental threat. 



